Intake-ESM Integration based on #1218 by charles-turner-1 · Pull Request #2690 · ESMValGroup/ESMValCore

charles-turner-1 · 2025-03-13T02:03:36Z

Description

Add intake-dataset class to load datasets via intake.
Update config-developer.yml to include intake datasets.

TODO:

Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped. Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent? I've been struggling to find them.
Tests - presumably the obvious place to stick these is in tests/unit/test_dataset.py, or is it preferable to add a new test module? I'll hold off writing these until I work out the facets issue.
Structure: I've put this in an intake submodule, but I could move it intodataset if that's preferable? Also affects previous point.

Have requested a review but obviously this is nowhere near ready to go on the infrastructure side wrt. tests, etc. A couple pointers in the right direction and that stuff should fly along.

Closes #31

Link to documentation:

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

…er.yml, skeleton of intake-esm inclusiion following #1218

… out

codecov · 2025-03-13T06:12:34Z

Codecov Report

❌ Patch coverage is 94.56522% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.61%. Comparing base (2792ad1) to head (27aa007).

Files with missing lines	Patch %	Lines
esmvalcore/io/intake_esm.py	94.56%	5 Missing ⚠️

❌ Your patch check has failed because the patch coverage (94.56%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2690      +/-   ##
==========================================
- Coverage   95.62%   95.61%   -0.01%     
==========================================
  Files         266      267       +1     
  Lines       15601    15693      +92     
==========================================
+ Hits        14918    15005      +87     
- Misses        683      688       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bouweandela

Great to see progress on this @charles-turner-1!

bouweandela · 2025-03-21T07:52:13Z

esmvalcore/config-developer.yml

    SYNDA: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
    NCI: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
  input_file: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
+  catalogs:


The plan was to not further extend config-developer, but rather move this to the new configuration that lives in ~/.config/esmvaltool. See #2371 for an example of what we thought the configuration should look like.

bouweandela · 2025-03-21T07:56:45Z

esmvalcore/config-developer.yml

+        - /g/data/oi10/catalog/v2/esm/catalog.json
+      facets:
+        # mapping from recipe facets to intake-esm catalog facets
+        # TODO: Fix these when Gadi is back up


You could also test on DKRZ Levante, the intake catalogs are located at /pool/data/Catalogs/dkrz_cmip6_disk.json

bouweandela · 2025-03-21T08:04:02Z

esmvalcore/intake/_dataset.py

+    return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)
+
+
+class IntakeDataset(Dataset):


I'm having some reservations about subclassing the Dataset class for this purpose:

A typical use case for many of our users will be that they have most data available from a central catalog that is managed by a central administrator, but want to augment that with the ability to download some files themselves. In that case, it is really useful to have the ability to deduplicate (e.g. pick the latest version of a file). I'm not sure if this can be achieved by subclassing the Dataset object.

We will likely want to add support for other catalogs as well, e.g. intake-esgf, xcube, and STAC. If we need a new Dataset class for each of these, it may become confusing to users.

How will this work from the recipe?

As an alternative, would it be an option to load the available data sources from the configuration / Dataset.session and then make the Dataset.files method loop over the available sources and deduplicate input files?

bouweandela · 2025-03-21T17:59:14Z

Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent?

ESMValCore is quite flexible with what facets it accepts. We have a translation between some of 'our' facets and the official ones in the esmvalcore.esgf.facets module (this is the subset that we use to search for files on ESGF). A few facets are used by ESMValCore for specific purposes such as CMOR checks and fixes (off the top of my head that would be dataset, project, mip, short_name), but others are entirely free-form and only used for finding input files and defining the output file names using the paths described in the config-developer.yml file.

Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped.

If these are completely determined by the other facets, you can add them automatically using the extra facets facility

bouweandela · 2025-03-21T18:03:35Z

Structure: I've put this in an intake submodule,

How about adding a new module called e.g. esmvalcore.data or esmvalcore.data_sources or something similar and adding it as a submodule there? We could also move the esmvalcore.local and esmvalcore.esgf modules there (does not have to be in this pull request). I foresee us adding multiple input data sources in the near future.

charles-turner-1 · 2025-03-22T00:44:49Z

Thanks for the review Bouwe, super helpful! I've only had a skim so far, but I'll get those suggestions incorporated next week

…stions

)

bouweandela · 2025-07-03T15:20:09Z

I started working on adding some interface code that could be useful here too in #2765.

charles-turner-1 · 2025-07-06T08:11:42Z

Cheers, I'll take a look when I get the chance! Gonna talk to Martin Durant (author of Intake) in ~10 days so hopefully this PR should pick up stone steam after then, I'll be working on this stuff more actively.

bouweandela · 2025-11-26T07:02:26Z

This should be a lot easier now that #2765 has landed. You could take the esmvalcore.io.intake_esgf module as an example and add a configuration file similar to data-intake-esgf.yml.

valeriupredoi · 2025-12-04T15:07:44Z

@charles-turner-1 I popped the latest main here, that includes #2765 - do you reckon you'll have time to restart the work on it soon, mate? If not, no biggie, just pls let @bouweandela and myself know - we can take it from here, there is a bit of a tight schedule on getting full Zarr support (not only as a simple load via esmvalcore IO), and I reckon this is superuseful towards that 🍻

charles-turner-1 · 2025-12-05T07:26:28Z

Been hoping to get back to this for a while... I just keep managing to find more urgent stuff to get in the way. Me & @rbeucher will be in Canberra together next week, so hopefully we can get a handle on our priorities then.

charles-turner-1 · 2026-01-20T05:39:00Z

Just a heads up that I'm getting back to this now - will reach out if I have any issues!

…at I've missed 😅

charles-turner-1 · 2026-02-05T02:59:20Z

Seems like the ordering of IntakeEsmDataSource.find_data is not stable, tests failing in CI (different outcome between runs...) are passing locally. (I'm working on this right now).
Test sample data is not being found either, which is not something I expected to go awry- we can happily just move the sample data out to an object storage service/use a dataset already there if we like? I'll fix it up with local data & then think about that later.

…o/_test_zarr` - might fix discovery?

charles-turner-1 · 2026-02-05T06:59:20Z

Few lines of coverage to fix, but I think this is mostly ready for review now!

charles-turner-1 added 10 commits February 4, 2025 10:02

Add recognised intake-esm datastores on NCI systems to config_develop…

c129966

…er.yml, skeleton of intake-esm inclusiion following #1218

Skeleton

b1b76fb

Playing around

dd73d1d

Almost at a working IntakeDataset.load()

ed1676b

Working intake-esm implementation - probably still some kinks to iron…

fa1ea2e

… out

Working with multiple catalogues per project

648f119

Cleanup - mypy & ruff errors

2b91fec

Remove WIP

c7b8ffb

Update depenencies & dev environment

31b35cb

Pre-commit modifications

a8532a5

charles-turner-1 requested a review from bouweandela March 13, 2025 02:03

charles-turner-1 added 3 commits March 13, 2025 11:45

Merge branch 'main' into intake-esm

7e56959

Fixed most of codacy (mypy-strict?) gripes

568cb8d

Fix typo

91fee56

charles-turner-1 requested a review from bettina-gier March 17, 2025 23:31

bouweandela reviewed Mar 21, 2025

View reviewed changes

charles-turner-1 added 2 commits April 2, 2025 13:19

Beginning to work on Bouwe's comments (WIP)

9d894b9

Updates - restructured esmvalcore/data/intake following Bouwe's sugge…

59d0d02

…stions

charles-turner-1 mentioned this pull request May 2, 2025

Facilitate creation of ESMValCore Dataset objects from datastores intake/intake-esm#715

Open

charles-turner-1 added 4 commits May 9, 2025 11:58

Reorder imports (ruff maybe?)

2050081

Add _read_facets to intake configuration: see intake/intake-esm#717

59e4205

Add merge_intake_seach_history function (see intake/intake-esm@73f150e

2527059

)

Merge branch 'main' into intake-esm

4641965

charles-turner-1 mentioned this pull request May 13, 2025

Add optional esmvalcore dependency & to_esmvalcore method intake/intake-esm#717

Draft

8 tasks

Merge branch 'main' into intake-esm

1b26148

readd intake

b77d194

Merge branch 'main' into intake-esm

e131cfc

charles-turner-1 and others added 11 commits January 22, 2026 10:46

Add data.io.intake_esm.py, scaffold off data.io.intake_esgf.py`

a53e140

WIP

b84cf70

Scaffold tests

ef6fdba

Remove /data/intake stuff, /config/_intake

f1b8f55

Pre-commit

ec17bfa

Merge branch 'main' into intake-esm

81fc7dc

Nearly there I think - all tests passing. Hopefully CI can tell me wh…

5417f6c

…at I've missed 😅

Pre-commit

d988b17

Remove old intake-esm file

8533644

Merge branch 'main' into intake-esm

1c41a35

Merge branch 'main' into intake-esm

15a79f5

charles-turner-1 added 4 commits February 5, 2026 14:07

Sort keys when finding data - should guarantee order stability

b7ceeea

Change path import style to match `/tests/integration/preprocessor/_i…

6dd736f

…o/_test_zarr` - might fix discovery?

Revert ugly type ignore stuff

5403ac8

Un-ignore the intake-esm data ncfiles

27aa007

charles-turner-1 marked this pull request as ready for review February 5, 2026 06:49

		return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)


		class IntakeDataset(Dataset):

Conversation

charles-turner-1 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before you get started

Checklist

Uh oh!

codecov bot commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bouweandela left a comment

Choose a reason for hiding this comment

Uh oh!

bouweandela Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

bouweandela Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

bouweandela Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bouweandela commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bouweandela commented Mar 21, 2025

Uh oh!

charles-turner-1 commented Mar 22, 2025

Uh oh!

bouweandela commented Jul 3, 2025

Uh oh!

charles-turner-1 commented Jul 6, 2025

Uh oh!

bouweandela commented Nov 26, 2025

Uh oh!

valeriupredoi commented Dec 4, 2025

Uh oh!

charles-turner-1 commented Dec 5, 2025

Uh oh!

charles-turner-1 commented Jan 20, 2026

Uh oh!

charles-turner-1 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charles-turner-1 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charles-turner-1 commented Mar 13, 2025 •

edited

Loading

codecov bot commented Mar 13, 2025 •

edited

Loading

bouweandela Mar 21, 2025 •

edited

Loading

bouweandela commented Mar 21, 2025 •

edited

Loading

charles-turner-1 commented Feb 5, 2026 •

edited

Loading